Bringing R to Serverless Cloud Computing

R-Ladies DC Meetup

Erika Tyagi

2024-07-11

Who am I?

  • Lead Data Engineer at the Urban Institute.
  • Former data scientist, researcher, etc.
  • Passionate about developing user-friendly systems that make analyzing data more accessible, reproducible, and fun.

What will I talk about today?

  1. What is AWS Lambda and how is it useful?
  2. How can you run R from Lambda?
  3. What are uses cases for integrating R and Lambda?

Motivation

  • Bridge the gap between two technologies that have (separately) led to innovation at the Urban Institute: R and AWS Lambda.
  • Make serverless cloud computing more accessible to researchers and data scientists.
  • Address “R can’t do…” pushback.

What is AWS Lambda? (without jargon)

A service from Amazon Web Services (AWS) that lets you run code in the cloud without having to manage servers.

What is AWS Lambda?

You can think of it like a kitchen in a restaurant.

  • You are the chef who only cares about preparing the food (or in this case, writing the code). You don’t have to worry about maintaining the kitchen, cleaning it, or even turning on the oven.
  • Lambda is the kitchen staff that takes care of all those tasks.
  • You just provide the recipe (your code), and Lambda executes it whenever a customer orders the dish (or when a specific event triggers your code).

What are common Lambda use cases?

  • Data processing: ETL, data validation, and data transformation
  • Web applications: APIs and microservices
  • Automation: Scheduled tasks, monitoring, and alerting
  • And much more!

How does Urban use Lambda?

What is AWS Lambda? (with jargon)

Focus on writing code, not managing infrastructure.

  • A serverless compute service from AWS that lets you run code in the cloud without having to manage servers.
  • You define a Lambda function to run code in a particular execution environment when triggered by an invocation event.
  • You only pay for what you use (based on the number of requests, allocated memory, and execution time).

Why is Lambda useful?

  • Cost: It’s highly cost-effective for many use cases.1
  • Ease of setup: It’s generally easier to set up and maintain than applications hosted on traditional servers.
  • Scalability: It quickly and automatically scales to meet demand.
  • Flexibility: It supports a variety of event sources and custom configurations, and is the heart of an active developer community.

What are key limitations to Lambda?

  • Each function has a maximum runtime of 15 minutes.
  • It has strict memory, storage, and concurrency constraints.1
  • It only natively support Java, Go, PowerShell, Node.js, C#, Python, and Ruby.

So how can I run R from Lambda?

Define a custom runtime (through a container image) with R.

  • Option 1: Use the rpy2 Python package.
  • Option 2: Use the lambdr R package.

Both options

Define a custom runtime (through a container image) with R.

  1. Build a custom container using a Dockerfile.
  2. Upload your container to AWS.
  3. Write your code.
  4. Deploy your Lambda function using the container image.

Option 1: rpy2 (1/2)

  • Create a Dockerfile from an AWS Python Lambda image.
  • Install R, system dependencies, any additional R packages, and the rpy2 Python package.
FROM public.ecr.aws/lambda/python:3.10

ENV R_VERSION=4.3.1

RUN yum -y install wget git tar openssl-devel libxml2-devel \
  && yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm \
  && wget https://cdn.rstudio.com/r/centos-7/pkgs/R-${R_VERSION}-1-1.x86_64.rpm \
  && yum -y install R-${R_VERSION}-1-1.x86_64.rpm \
  && rm R-${R_VERSION}-1-1.x86_64.rpm \
  && yum -y clean all \
  && rm -rf /var/cache/yum

ENV PATH="${PATH}:/opt/R/${R_VERSION}/bin/" 
ENV LD_LIBRARY_PATH="/opt/R/${R_VERSION}/lib/R/lib/" 

RUN R -e "install.packages(c('aws.s3', 'dplyr'), \
  repos = c(CRAN = 'https://packagemanager.posit.co/cran/__linux__/centos7/latest'))"

COPY requirements.txt  .
RUN  pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"

COPY . ${LAMBDA_TASK_ROOT} 

Option 1: rpy2 (2/2)

  • Write your R code.
parity <- function(number) {
    return (if (as.integer(number) %% 2 == 0) "even" else "odd")
}
  • From your Python code, use rpy2 to source and call your R code from the Lambda handler.
from rpy2.robjects import r

def lambda_handler(event, context):
   number = event['number']
   r('''source("utils.R")''')
   return r['parity'](number)[0]

Option 2: lambdr (1/2)

  • Create a Dockerfile from the AWS base Lambda image.
  • Install R, system dependencies, the lambdr R package, any additional R packages, and a bootstrap file.
FROM public.ecr.aws/lambda/provided

ENV R_VERSION=4.0.3
ENV R_SCRIPT=app.R 

RUN yum -y install wget git tar openssl-devel libxml2-devel \
  && yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm \
  && wget https://cdn.rstudio.com/r/centos-7/pkgs/R-${R_VERSION}-1-1.x86_64.rpm \
  && yum -y install R-${R_VERSION}-1-1.x86_64.rpm \
  && rm R-${R_VERSION}-1-1.x86_64.rpm \
  && yum -y clean all \
  && rm -rf /var/cache/yum

ENV PATH="${PATH}:/opt/R/${R_VERSION}/bin/"

RUN R -e "install.packages(c('aws.s3', 'dplyr', 'lambdr'), repos = 'https://cloud.r-project.org/')"

RUN mkdir /lambda
COPY ${R_SCRIPT} /lambda
RUN chmod 755 -R /lambda

RUN printf '#!/bin/sh\ncd /lambda\nRscript ${R_SCRIPT}' > /var/runtime/bootstrap \
  && chmod +x /var/runtime/bootstrap

Option 2: lambdr (2/2)

  • Write your R code and define your handler.
  • From your R code, start the Lambda runtime.
parity <- function(number) {
    return (if (as.integer(number) %% 2 == 0) "even" else "odd")
}

lambdr::start_lambda()

How has Urban used R with Lambda?

  • Automate data collection: Regularly check websites to download new data, perform checks, generate summary tables, and email researchers with links to the latest files.
  • Safely expand access to confidential data: Implement computationally intensive differential privacy algorithms at scale through an automated validation server prototype.

Thank you!